Home

Exploration of Loan data from OrangeJuice

About OranjeJuice

Orangejuice is a fictitious company which gives out loans as well as investments. The data for Orangejuice is based on a real data set from an existing company. The name has been changed into OrangeJuice so that there is no bad judgment of the real organization. The report is a pure attempt to explore the trends hidden in the data set. Please keep in in mind, though the names have been changed the analysis is done for a real world data set so the numbers are real.

Introduction

The dataset is from OrangeJuice.com and consists of thousands of loan information with 81 variables that are present. OrangeJuice.com is a platform for borrowing or investing in loans. The concept of OrangeJuice is to make borrowing easy. There are many OrangeJuice variables in the data set as there are two sides of the loan, one through investment and the other through lending out money. What I am looking at are the interest rates that are floating around the market for the loans. I want to find out about how the interest rate is influenced by common indicators such as occupation or stated income.For my investigation, I am only looking into the perspective of a borrower and want to find out what are the mechanics behind the interest rates.

Univariate Analysis

The first variable that will be looked into is the Interest rates. Interest rates consist of two main types, one is an annual percentage rate and the other being the borrower’s interest. The main difference between the two is that the annual rate takes into consideration different costs such as additional legal fees and hidden costs. The interest rate is the extra money you pay over a certain time period for your loan.

library('scales')
library('memisc')
library('lattice')
library('MASS')
library('car')
library('plyr')
library('reshape')
library('GGally')
library(gridExtra)
library('RColorBrewer')
library('ggplot2')
library(dplyr)
library(rworldmap)
library(ggmap)
library(maps)
library("ggrepel")
loans <- read.csv("OrangeJuiceLoanData.csv")
#names(loans)
loan_data <- loans %>% 
  subset(IncomeVerifiable == "TRUE") %>% 
  select(BorrowerAPR,BorrowerRate,LoanOriginalAmount,OrangeJuiceScore,LenderYield,EstimatedLoss,EstimatedEffectiveYield,Occupation,BorrowerState,EmploymentStatusDuration,EmploymentStatus,IsBorrowerHomeowner,CreditScoreRangeLower,CreditScoreRangeUpper,IncomeRange,StatedMonthlyIncome,Investors)
loan_data <- subset(loan_data, !is.na(BorrowerAPR))
loan_data$average_cs <- (loan_data$CreditScoreRangeUpper + loan_data$CreditScoreRangeLower) / 2
str(loan_data)
## 'data.frame':    105243 obs. of  18 variables:
##  $ BorrowerAPR             : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate            : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LoanOriginalAmount      : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ OrangeJuiceScore        : int  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ LenderYield             : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ EstimatedLoss           : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ EstimatedEffectiveYield : num  NA 0.0796 NA 0.0849 0.1832 ...
##  $ Occupation              : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ BorrowerState           : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ EmploymentStatusDuration: int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ EmploymentStatus        : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ IsBorrowerHomeowner     : logi  TRUE FALSE FALSE TRUE TRUE TRUE ...
##  $ CreditScoreRangeLower   : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper   : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ IncomeRange             : Factor w/ 9 levels "","$0","$1-24,999",..: 5 6 8 5 4 4 5 5 5 5 ...
##  $ StatedMonthlyIncome     : num  3083 6125 2083 2875 9583 ...
##  $ Investors               : int  258 1 41 158 20 1 1 1 1 1 ...
##  $ average_cs              : num  650 690 490 810 690 ...
all_states <- map_data("state")
state_list <- read.csv("states.csv")
colnames(state_list)[2] <-"BorrowerState"
loan_data2 <- left_join(loan_data,state_list, by = "BorrowerState")
loan_data2$State <- tolower(loan_data2$State)
colnames(loan_data2)[19] <- "region"

sample_loan <- select(loan_data2,BorrowerAPR,LoanOriginalAmount,region)
sample_loan$region_freq <- 1
sample_loan <- subset(sample_loan, !is.na(region))

map_loan <- sample_loan %>% 
  select(region,BorrowerAPR,LoanOriginalAmount,region_freq) %>% 
  group_by(region) %>% 
  summarise(averageAPR = mean(BorrowerAPR),averageLA = mean(LoanOriginalAmount),count = sum(region_freq))
map_loan_final <- left_join(all_states,map_loan, by ="region")

Borrower APR

ggplot(data = loan_data,aes(x = BorrowerAPR)) + geom_histogram(binwidth = 0.005, color = "tomato1",fill = "royalblue" ) 

The data seems to be well distributed. This is with a bin width of 0.005 so the bins are very thin. Due to the distribution being even, we can move along without any major transformations to the variable. The data is very well centered around the mean which can be seen looking at the plot. Some of the data have a high frequency which can also be seen in the figure. ###BorrowerRate

ggplot(data = loan_data,aes(x = BorrowerRate)) + geom_histogram(binwidth = 0.005, color = "maroon1",fill = "seagreen2")

The borrower rate also seems to be well distributed. An interesting thing we can see in the graph is that it shares a similar if not the same mode as the Borrower APR. The two graphs (Borrower Rate and APR) have the same bin width 0.005.

set.seed(5555)
loan_sample <-  select(loan_data, -IncomeRange,-CreditScoreRangeLower,-CreditScoreRangeUpper,-IsBorrowerHomeowner,-BorrowerState,-Occupation)
loan_sample <- loan_sample[sample(1:length(loan_sample$BorrowerAPR), 1000),]
ggpairs(loan_sample)

The grid has relationships of all the numerical variable that are present in the data set. If you are looking for trends in the data set, the first thing you might notice is that some data have high levels of correlation, an example of this is the interest rates and yield amount which seem to be very correlated. This will be my starting investigation.

LenderYield

ggplot(data = loan_data,aes(x = LenderYield)) + geom_histogram(binwidth = 0.005,fill = "greenyellow", color = "indianred")  

The Lender Yield by itself looks a lot like APR and borrower interest rates. Yield is often associated with a return on a debt. Here the yield would be associated with the persons loaning out the money. Yield, APR, and borrower interest are highly correlate

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}
p1 <- ggplot(data = loan_data,aes(x = (BorrowerAPR))) +geom_histogram(binwidth = 0.005,fill = "white",color = "black") +
  geom_vline(xintercept = Mode(loan_data$BorrowerAPR),color = "red") + 
  geom_vline(xintercept = median(loan_data$BorrowerAPR),color = "green") +
  geom_vline(xintercept = mean(loan_data$BorrowerAPR),color = "blue") +
    scale_x_continuous(lim = c(0,0.4),breaks = seq(0,0.4,0.1))
p2 <- ggplot(data = loan_data,aes(x = (BorrowerRate))) +geom_histogram(binwidth = 0.005,fill = "white",color = "black") +
  geom_vline(xintercept = Mode(loan_data$BorrowerRate),color = "red") + 
  geom_vline(xintercept = mean(loan_data$BorrowerRate),color = "blue") +
  geom_vline(xintercept = median(loan_data$BorrowerRate),color = "green") +
  scale_x_continuous(lim = c(0,0.4),breaks = seq(0,0.4,0.1))
p3 <- ggplot(data = loan_data,aes(x = LenderYield)) +geom_histogram(binwidth = 0.005,fill = "white",color = "black") + 
  geom_vline(aes(xintercept = Mode(loan_data$LenderYield)),color = "red") + 
  geom_vline(aes(xintercept = mean(loan_data$LenderYield)),color = "blue") +
  geom_vline(aes(xintercept = median(loan_data$LenderYield)),color = "green") +
  scale_x_continuous(lim = c(0,0.4),breaks = seq(0,0.4,0.1))
grid.arrange(p1,p2,p3, ncol = 1)

All the three plots are on the same scale. Comparing the three variables by looking at the figure it seems they have a similar distribution. The red line is the mode, the blue is for the mean and the green is the median indicator. With indicators, it can be seen that the data have similar middle values which differ by only marginal amounts.

Original Loan Amount

ggplot(data = loan_data,aes(x = LoanOriginalAmount)) + geom_histogram(binwidth = 1000,color = "red3",fill = "black")

The plots show loan amounts that were disbursed to individual loan seekers. OrangeJuice has a slight skewness in the data for loans. There is a higher frequency of loans worth less than 10,000$ than loans that are worth more. The majority of OrangeJuice users could be people seeking small amounts in loans.

summary(loan_data$LoanOriginalAmount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8439   12000   35000

Looking at the descriptive statistics of the loan disbursement it can be seen that most of the loans are in between 4000 to 12,000. The smallest loan amount was a 1000 and the largest was 35,000. The lowest amount is not that less compared to the first quartile but the largest loan is much more than the 3rd quartile amount. This could mean that large loans are few and rare for OrangeJuice. It is noteworthy though, that loan amount of 25,000 though far from the central data range do have a high frequency.

ggplot(data = loan_data,aes(x = log(LoanOriginalAmount))) + geom_histogram(color = "slateblue", fill = "grey") 

The distribution looks like a bimodal distribution. There are three peaks but the two major ones are higher than the others, giving the shape of a bimodal distribution. The data has been altered using the log scale. In the first data set, there was skewness which has been altered with the use of log on the variable. The log transformation also allows for linear regressions to occur. Any future regression analyses would find the log transformation very handy.

Estimated Effective Yield and Estimated Loss

p4 <- ggplot(data = loan_sample,aes(x = EstimatedEffectiveYield)) +geom_histogram(color = "darkorchid", fill = "darkorange") 
p5 <- ggplot(data = loan_sample,aes(x = EstimatedLoss)) +geom_histogram(color = "darkorange", fill = "darkorchid")  
grid.arrange(p4,p5,ncol = 1)  

The estimated loss and estimated yield seem to be distributed evenly as well. Though the spread is very even, the presence of some extreme data stretch the maximum and minimum values. The extreme values for Yield seem to be on the negative side while for estimated loss is on the positive side. The extreme values could be a causation of some of the large loan amounts that OrangeJuice had lent out. Effective-estimated yield depends upon reinvestment in bonds, or in this case the loans. The estimates are mostly used to help investors invest in future loans. The negative values of yield could be a prediction for some ongoing or future investments that could fail. Both the variables are predictions, hence they could influence consumer decisions and the loans themselves.

Occupation of a person/ State

 ggplot(data = loan_data,aes(x = Occupation)) + geom_bar(color = "navyblue", fill = "olivedrab") +coord_flip()

This is a discrete variable. The group “others” for this variable has the largest frequency. This is disappointing as it does not allow the investigation of which occupation got the best loan rates and highest amount. Though the variable might be not as expected, whatever information we can take out can still be used for further analysis.

Employement Status and duration

p9 <- ggplot(data = subset(loan_data,EmploymentStatus != ""),aes(x = EmploymentStatus)) + geom_bar(color = "paleturquoise", fill = "palevioletred") 
p10 <- ggplot(data = loan_data,aes(x = EmploymentStatusDuration)) + geom_bar(fill = "paleturquoise", color = "palevioletred") 
grid.arrange(p9,p10,ncol = 1)

For employment status I had to filter out the blank data. This means that about 2000 or so data are missing in the plot.The employed status plot shows that most of the people who are receiving loans are most employed. The duration graph did not have any filtering. For the group “EmploymentStatusDuration” the data is skewed. The plot shows loan disbursement is highest with people who have stated their work duration as less than a year. This could be due to the short time frame OrangeJuice has been active, users being young adults and recently working; or because users being either retired or not working people.

ggplot(data = loan_data,aes(x = log2(EmploymentStatusDuration))) + geom_bar(binwidth = 0.1, color = "black", fill = "yellow")

What is the meaning of log base 2 of employment duration? Log base 2 of “x” can be interpreted as, what is the “nth” power of 2 that would result in “x”. This transformation of employment duration can be used in regression analysis. Besides regression, it can be seen in the figure that the log 2 values of the work duration are centered around six. So it can be said that log 2 of (Most of the employed duration) = 6.

Income range and stated monthly income

p11 <- ggplot(data = loan_data,aes(x = IncomeRange)) + geom_bar(fill = "steelblue",color = "black") 
p12 <- ggplot(data = subset(loan_data,StatedMonthlyIncome < 100000),aes(x = StatedMonthlyIncome)) + geom_bar(binwidth = 1000, fill = "black", color = "steelblue") 
grid.arrange(p11,p12,ncol = 1)

summary(loan_data$StatedMonthlyIncome)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3333    4750    5655    6846  483333

TThe graph is a subset of the total data, taking into consideration stated incomes less than 100,000. That amount seems inaccurate as that would be an outrageous amount of money to say is your monthly income. The data has already been filtered at first to see if the income stated was verifiable or not. Looking at the values around the quartiles, a majority of the data is below 10,000. So are extreme incomes false? Should they be removed? It really makes it difficult to judge whether the data is accurate, it seems as if there are some data quality issues. For the investigation, I am filtering out these extreme values but only for the variable, not the data set.

p13 <- ggplot(data = loan_data,aes(x = log(StatedMonthlyIncome))) + geom_bar(binwidth = 0.1,fill = "deeppink1") 
p14 <- ggplot(data = loan_data,aes(x = log2(StatedMonthlyIncome))) + geom_bar(binwidth = 0.1,fill = "deeppink")
p15 <- ggplot(data = loan_data,aes(x = log10(StatedMonthlyIncome))) + geom_bar(binwidth = 0.01, fill = "deeppink3")
grid.arrange(p13,p14,p15,ncol = 2)

These are transformations of the variable into log, log 2 and log 10 forms.

OrangeJuiceScore and Credit score

p16 <- ggplot(data = loan_data,aes(x = OrangeJuiceScore)) + geom_bar(color = "firebrick1", fill = "goldenrod3") 
p17 <- ggplot(data = loan_data,aes(x = average_cs)) + geom_bar(fill = "firebrick1", color = "goldenrod3") 
grid.arrange(p16,p17,ncol =1)

The OrangeJuice score and average credit score seem well distributed. The average credit score is an average of a lower bound and an upper bound of credit score. Both variables seem to be discrete but could be manipulated into a continuous form.

Geographical Distribution(States)

cnames <- aggregate(cbind(long, lat) ~ region, data=map_loan_final, 
                    FUN=function(x)median(range(x)))

ggplot() + geom_polygon(data= map_loan_final, aes(x=long, y=lat, group = group, fill=map_loan_final$count),colour="white") +
  scale_fill_continuous(low = "peachpuff1", high = "black", guide="colorbar") +
  theme_bw()  +
  labs(title = "Count of loan disbursement based on OrangeJuice data across the US",fill = "count") +
  scale_y_continuous(breaks=c()) + 
  scale_x_continuous(breaks=c()) + 
  theme(panel.border =  element_blank()) + geom_text_repel(data=cnames, aes(long, lat, label = region), size=4, color = "red") 

The map shows how data has been spread across the states. California seems to have the most loan disbursements. This can be because OrangeJuice is a company based in California. Besides California, New York, Texas, and Florida who have a lot of OrangeJuice users. The lowest amount of users were present in North Dakota. Maine, Wyoming and South Dakota were also states with low users. The influence of OrangeJuice can be seen in about a little more than half the United States.

summary(map_loan$count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      50     420    1005    1959    2684   13504

Homewonership

Homeownership is a variable that can give insights into loan disbursement. Home ownership is important for big loans as the home can be used as collateral.

summary(loan_data$IsBorrowerHomeowner)
##    Mode   FALSE    TRUE 
## logical   51170   54073

OrangeJuice advertises that its loans are collateral free. The count for if a person is a homeowner or not is given in the table above. Homeowners and non-homeowners are almost equal in amount. It would be hard to summarize whether owning a home helps you get a loan or not with the amount of information we have now. This question will be looked more carefully later on in the investigation. ###Investors

ggplot(data =loan_data,aes(x = Investors)) + geom_bar(binwidth = 0.1) +scale_x_continuous(lim = c(0,10),breaks = seq(0,10,1))

Having just one investor is the most common. This causes a unproportionate skewness. More relationships can be derived using Investors as variable.

Bivariate Analysis

cor_data <- select(loan_data,-Occupation,-BorrowerState,-EmploymentStatus,-IsBorrowerHomeowner,-CreditScoreRangeLower,-CreditScoreRangeUpper,-IncomeRange)
cor_data <- subset(cor_data,!is.na(OrangeJuiceScore))
cor_data <- subset(cor_data,!is.na(EmploymentStatusDuration))
cor_result <- cor(cor_data, method = "pearson")
data.frame(cor_result)
##                          BorrowerAPR BorrowerRate LoanOriginalAmount
## BorrowerAPR                1.0000000   0.99344268        -0.41822571
## BorrowerRate               0.9934427   1.00000000        -0.40547431
## LoanOriginalAmount        -0.4182257  -0.40547431         1.00000000
## OrangeJuiceScore          -0.6746714  -0.65673358         0.26445293
## LenderYield                0.9934436   0.99999575        -0.40543073
## EstimatedLoss              0.9528574   0.94883823        -0.42282653
## EstimatedEffectiveYield    0.9021611   0.90162603        -0.32565609
## EmploymentStatusDuration  -0.0236893  -0.02352023         0.06730129
## StatedMonthlyIncome       -0.1568695  -0.15495373         0.30022734
## Investors                 -0.2705047  -0.24735310         0.31963067
## average_cs                -0.5468156  -0.52854073         0.28576478
##                          OrangeJuiceScore LenderYield EstimatedLoss
## BorrowerAPR                  -0.674671374  0.99344360    0.95285741
## BorrowerRate                 -0.656733576  0.99999575    0.94883823
## LoanOriginalAmount            0.264452932 -0.40543073   -0.42282653
## OrangeJuiceScore              1.000000000 -0.65677965   -0.68022357
## LenderYield                  -0.656779653  1.00000000    0.94884888
## EstimatedLoss                -0.680223567  0.94884888    1.00000000
## EstimatedEffectiveYield      -0.642816341  0.90168465    0.81496056
## EmploymentStatusDuration     -0.008279797 -0.02351152   -0.02534737
## StatedMonthlyIncome           0.156722217 -0.15495087   -0.15257092
## Investors                     0.322933531 -0.24739903   -0.28129276
## average_cs                    0.387073964 -0.52855470   -0.53230202
##                          EstimatedEffectiveYield EmploymentStatusDuration
## BorrowerAPR                           0.90216113             -0.023689302
## BorrowerRate                          0.90162603             -0.023520232
## LoanOriginalAmount                   -0.32565609              0.067301290
## OrangeJuiceScore                     -0.64281634             -0.008279797
## LenderYield                           0.90168465             -0.023511521
## EstimatedLoss                         0.81496056             -0.025347368
## EstimatedEffectiveYield               1.00000000             -0.010999691
## EmploymentStatusDuration             -0.01099969              1.000000000
## StatedMonthlyIncome                  -0.13428710              0.064872948
## Investors                            -0.27288032             -0.019541931
## average_cs                           -0.46941774              0.025787264
##                          StatedMonthlyIncome   Investors  average_cs
## BorrowerAPR                      -0.15686951 -0.27050473 -0.54681559
## BorrowerRate                     -0.15495373 -0.24735310 -0.52854073
## LoanOriginalAmount                0.30022734  0.31963067  0.28576478
## OrangeJuiceScore                  0.15672222  0.32293353  0.38707396
## LenderYield                      -0.15495087 -0.24739903 -0.52855470
## EstimatedLoss                    -0.15257092 -0.28129276 -0.53230202
## EstimatedEffectiveYield          -0.13428710 -0.27288032 -0.46941774
## EmploymentStatusDuration          0.06487295 -0.01954193  0.02578726
## StatedMonthlyIncome               1.00000000  0.12276229  0.10550350
## Investors                         0.12276229  1.00000000  0.36063686
## average_cs                        0.10550350  0.36063686  1.00000000

The test that all numeric variables went through was the Pearson correlation test. The test compares values on a scale of -1 to 1. The closer the score is to positive or negative one, the stronger is the relationship. A score closer to 0 indicates a poor relation of the variables. Anything below 0.2 and above -0.2 would be poor scores. Amongst the correlation scores, it can be seen that “BorrowerRate” and “LenderYield” have a very high correlation. Other variables have a range of correlation, most correlations are greater than 0.2, but some like employment duration have very poor Pearson scores.

APR vs RATE

ggplot(data = loan_data,aes(x = BorrowerAPR,y = BorrowerRate))+geom_point(alpha = 0.2, color = "pink") + geom_smooth(method = 'lm') 

These variables have high correlation scores. This could mean a couple of things for the analysis. In terms of pure relation, it can be said that these two could be perfect predictors of each other. If a regression analysis were to be conducted these variables need to be tested for collinearity. The exclusion of multicollinearity is one of the assumptions of regression.

APR vs Yield

ggplot(data = loan_data,aes(x = BorrowerAPR,y = LenderYield))+geom_point(alpha = 0.2, color = "pink") + geom_smooth(method = 'lm') 

LenderYield and BorrowerAPR also seem to be very correlated. High levels of correlation could lead to problems in linking cause and results. Looking at the plot LenderYield and BorrowerAPR seem to have an almost perfect positive relationship.

Yield vs Rate

ggplot(data = loan_data,aes(x = BorrowerRate,y = LenderYield))+geom_point(alpha = 0.2, color = "orange") + geom_smooth(method = 'lm') 

Multicollinearity is related to the independent variables in a linear regression. The usefulness of your model for regression depends on whether the variables follow the assumptions or not. LenderYield and BorrowerRate if tested and found highly collinear, might need to be excluded from the regression model. LenderYield and BorrowerRate also seem to have very low error terms, which results in the scatter plot resembling the linear line in the plot.

OrangeJuiceScore vs APR

ggplot(data = loan_data,aes(y = BorrowerAPR,x = factor(OrangeJuiceScore)))+geom_boxplot() 

The OrangeJuice score is factored for the plot. Most of the average BorrowerAPR given for each OrangeJuiceScore, seem to decrease as the OrangeJuice score gets higher. There are a few data which have received good rates with bad OrangeJuiceScores and vice-versa. Besides some oddities, it seems that a good OrangeJuice score can get you a good interest rate.

Loan Original Amount vs OrangeJuiceScore

ggplot(data = loan_data,aes(y = LoanOriginalAmount,x = factor(OrangeJuiceScore)))+geom_boxplot() 

As you can see when you compare the score with loan ammounts you see a simmilar trend where better scores give you more amounts of money. The only interseting point is that a OrangeJuice score of 8 would has higher amounts of loan disbrsemnets than of score 9.

APR vs credit score

ggplot(data = subset(loan_data, !is.na(average_cs)),aes(y = BorrowerAPR,x = average_cs))+geom_smooth()
## `geom_smooth()` using method = 'gam'

The relation between credit score and loan disbursement is positive. There is no movement of data for credit scores less than 500. This could be because credit scores less than 500 are not a part of the OrangeJuice database.

Loan amount vs credit score

ggplot(data = subset(loan_data, !is.na(average_cs)),aes(y = LoanOriginalAmount,x = average_cs))+geom_smooth()
## `geom_smooth()` using method = 'gam'

The relation between average credit score and loan disbursement is positive. There is no movement of data for credit scores less than 500. This could be because most credit scores less than 500 are not a part of the OrangeJuice database.

Credit score vs OrangeJuice score

ggplot(data = loan_data,aes(x =OrangeJuiceScore,y = average_cs))+geom_smooth()
## `geom_smooth()` using method = 'gam'

The OrangeScore is supposed to be like the credit score. They also seem to share a similar relationship. Note that most credits scores are above 500 and credit score which has OrangeJuice scores are only as low as 660.

APR vs Estimated Loss

ggplot(data = loan_data,aes(y = BorrowerAPR,x = EstimatedLoss))+geom_jitter(alpha = 0.5, color = "purple") + geom_smooth()
## `geom_smooth()` using method = 'gam'

As there is an increase in loss estimates, there seems to be a corresponding increase in the BorrowerAPR for the loan. This could be an attempt to get the most out of a loan. There are other variables in the original data set which affect these two variables. One of the variables is estimated returns, which I have not included in this report but would be key in future analyses of these variables.

APR vs Estimated Effective yield

ggplot(data = loan_data,aes(y = BorrowerAPR,x = EstimatedEffectiveYield))+geom_jitter(alpha = 0.2, color = "purple") + geom_smooth(method = "lm") + coord_flip()

Yield and BorrowerRate seem to share a positive linear relationship. There are other straight purple lines parallel to the linear line on the plot. This could be a result of high demands or supply for a particular rate.

Stated income vs APR

ggplot(data = subset(loan_data, StatedMonthlyIncome < 50000),aes(x = BorrowerAPR,y = StatedMonthlyIncome))+geom_jitter(alpha = 0.5, color = "purple") + geom_smooth()
## `geom_smooth()` using method = 'gam'

The data has been subsetted to remove the extreme income values. The majority of the data lies below 20,000$, which still seems very high for stated incomes. There could have been a mix up while collecting the data where people put in yearly incomes as monthly incomes. According to the trend line, it shows that having larger income slightly reduces the income rates for you.

Stated Income vs Loan Amount

ggplot(data = subset(loan_data, StatedMonthlyIncome < 50000),aes(x = LoanOriginalAmount,y = StatedMonthlyIncome))+geom_jitter(alpha = 0.2, color = "grey") + geom_smooth(color = "red",linetype = 2)
## `geom_smooth()` using method = 'gam'

The stated Monthly income has a better relationship with the original loan amount. It can be seen having a larger income enables you to get larger loan amount.

APR vs Loan amount

ggplot(data = loan_data,aes(x = BorrowerAPR,y = LoanOriginalAmount))+geom_jitter(alpha = 0.1, color = "Maroon")

The correlation test with the subsetted data states that the correlation is -0.41. Without subsets, the relation number is -0.3 which I believe could be because of extreme values. A result of -0.42 can be seen when we filter out all N.A. values for OrangeJuice scores, this leaves a subset of about 70,000 entries.

BorrowerAPR vs Investors

np1 <- ggplot(data = loan_data,aes(y = BorrowerAPR,x = Investors))+geom_point(color ="darkgreen", alpha = 0.01)
np2 <- ggplot(data = loan_data,aes(y = BorrowerAPR,x = Investors))+geom_point(color ="darkorange")
grid.arrange(np2,np1,ncol = 2)

There is a negative relation between Borrower APR and the number of investors on the loan. There are different loans with various amounts of investors which you can see in the orange plot. The green plot shows the density of the number investors on the loans. Having investors less than 100 have the highest frequency.

BorrowerAPR vs Loan Amount

np3 <- ggplot(data = loan_data,aes(y = LoanOriginalAmount,x = Investors))+geom_point(color = "pink")
np4 <- ggplot(data = loan_data,aes(y = LoanOriginalAmount,x = Investors))+geom_point(color ="tomato", alpha = 0.01)
grid.arrange(np3,np4,ncol = 2)

Investors and loan amount have a positive relationship. A higher number of investors are associated with larger loan amounts. The pink plot has a gap between 25,000 and 35,000, this could be because of few loans being disbursed over 25,000. The red plot shows the density of loan amounts and investors. The majority of the loan activities take place below 10,000 and most investors are investing in those loans. Besides values under 10,000, 15,000, 20,000 and 25,000 also have a lot of investors.

EmployementStatus Duration vs APR

ggplot(data = loan_data,aes(y = BorrowerAPR,x = EmploymentStatusDuration))+geom_jitter(alpha = 0.1, color = "Maroon") +geom_smooth()
## `geom_smooth()` using method = 'gam'

There seems to be a similar situation between the stated employement days and interest rate. Hence we can say that the employement duration is a non factor for the case with OrangeJuice variables and at the end of the dropping this would be my next step as it doesnt share any concrete realtion with any variable.

Average loan amounts geographically and average rates geographically

ggplot() + geom_polygon(data= map_loan_final, aes(x=long, y=lat, group = group, fill=map_loan_final$averageAPR),colour="white") +
scale_fill_continuous(low = "yellow", high = "darkred", guide="colorbar") +
theme_bw()  +
labs(title = "Average loan rates based on OrangeJuice data across the US",fill = "Average APR") +
scale_y_continuous(breaks=c()) + 
scale_x_continuous(breaks=c()) + 
theme(panel.border =  element_blank()) + geom_text_repel(data=cnames, aes(long, lat, label = region), size=4, color = "orange") 

For the interest rates floating around the US, it can be seen that a lot of the States have high APR on their loans.

summary(map_loan$averageAPR)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1738  0.2137  0.2172  0.2174  0.2221  0.2362

The lowest average rates can be seen in the states of Maine and Iowa. The other states with low average rates include Alaska (not on the map) and District of Columbia. Interest rates may be influenced by geographical and demographical factors. This would be some of the errors in the current data set and possibly the larger data set.

ggplot() + geom_polygon(data= map_loan_final, aes(x=long, y=lat, group = group, fill=map_loan_final$averageLA),colour="white") +
  scale_fill_continuous(low = "turquoise", high = "royalblue4", guide="colorbar") +
  theme_bw()  +
  labs(title = "Average loan disbursement based on OrangeJuice data across the US",fill = "Average disbursed amount") +
  scale_y_continuous(breaks=c()) + 
  scale_x_continuous(breaks=c()) + 
  theme(panel.border =  element_blank()) + geom_text_repel(data=cnames, aes(long, lat, label = region), size=4,color = "gold") 

Average loan amounts were highest in the District of Columbia. The Northeast has many states with high average loan amount received. Alaska(not in the map) also has received high average loan amounts.

summary(map_loan$averageLA)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4154    7979    8381    8290    8880   10251

The lowest amounts recieved were in North Dakota, Maine, and Iowa.

Multivariate Analysis

ggplot(data = loan_data,aes(y = BorrowerAPR,x = LenderYield))+geom_jitter(alpha = 0.1,aes( color = loan_data$BorrowerRate))

There seems to be a common relationship between all three variables. All three variables seem to increase with one another. They seem to come close of one another in terms of value. The APR is a calculation of the interest rate. Yield seems highly correlated with the two rates. The yield is an earning on the loan. The usefulness of including all three variables have to be reassessed due to the high correlation of the variables.

ggplot(data = loan_data,aes(y = BorrowerAPR,x = LoanOriginalAmount))+geom_jitter(alpha = 0.78,aes( color = loan_data$average_cs)) + facet_wrap(~IsBorrowerHomeowner) + scale_color_gradient(low = "yellow", high = "#6C3483",space = "Lab", guide = "colourbar") + geom_smooth()
## `geom_smooth()` using method = 'gam'

Now we look at three factors, the credit score, loan amount and APR with two different scenarios one where the borrower is a homeowner and another where they are not. What we see is that there is not much difference based on whether the person is a homeowner or not. Besides that, we can see a negative relationship between interest rate and Loan amount. This is still not the strongest but now we can make assumptions about how to get lower rates. There is a mix of good and bad credit scores getting low and high loan rates. Also based on being a homeowner or not, it does not make much of a difference but if you look at the plots you see a higher concentration of loan give aways of loans of value greater than 25,000 to people who are home owners.

ggplot(data = loan_data,aes(y = BorrowerAPR,x = OrangeJuiceScore))+geom_point(aes( color = loan_data$average_cs, size = loan_data$LoanOriginalAmount),alpha =0.6) + scale_color_gradient(low = "tomato", high = "black",space = "Lab", guide = "colourbar") + geom_smooth()
## `geom_smooth()` using method = 'gam'

The plots show how OrangeJuice score plays an important role in determining the loan amount and interest rate.

ggplot(data = loan_data,aes(y = BorrowerAPR,x = OrangeJuiceScore))+geom_point(alpha = 1,aes( color = loan_data$average_cs, size = loan_data$LoanOriginalAmount)) + facet_wrap(~EmploymentStatus) + scale_color_gradient(low = "turquoise1", high = "springgreen4",space = "Lab", guide = "colourbar") + geom_smooth()
## `geom_smooth()` using method = 'gam'

Looking at the OrangeJuice Score we see an inverse relationship. Good OrangeJuice scores allow for lower interest rates. The loan amount disbursed is the size of the bubbles. It can be seen that retired people, unemployed or part timers are not receiving loans. Surprisingly people who have listed themselves as ‘others’ on their employment status have received loans.

ggplot(data = loan_data, aes(y = BorrowerAPR, x = Occupation))  + geom_point(aes(color = loan_data$LoanOriginalAmount),size = 2.3) + scale_color_gradient(low = "yellow", high = "red4",space = "Lab", guide = "colourbar") + coord_flip() + geom_boxplot(alpha = 0.1)

When we look at the different occupations with relation to rates and the disbursement amounts, we see students and investors seem to have a low-interest rate. Loan disbursement amounts seem to be higher in professions with higher pay.

est <- subset(loan_data,IncomeRange != "Not displayed" & IncomeRange != "Not employed" & IncomeRange != "$0")
est$IncomeRange <- factor(est$IncomeRange, levels = c("$1-24,999","$25,000-49,999","$50,000-74,999","75,000-99,999","$100,000+"))
ggplot(data = subset(est,!is.na(IncomeRange)), aes(y = BorrowerAPR, x = LoanOriginalAmount)) + geom_point(aes(color = IncomeRange)) + scale_color_brewer(type = "div")

Looking at the graph, you can see that Income Range does play an effect on the loan amount and interest rates. An example of this is being from a lower income bracket generally means being able to get lower loan amounts, but because loan amount and rates have inverse relationship lower incomes end up receiving higher interest rates.

est <- subset(loan_data,IncomeRange != "Not displayed" & IncomeRange != "Not employed" & IncomeRange != "$0")
est$IncomeRange <- factor(est$IncomeRange, levels = c("$1-24,999","$25,000-49,999","$50,000-74,999","75,000-99,999","$100,000+"))
ggplot(data = subset(est,!is.na(IncomeRange)), aes(y = BorrowerAPR, x = LoanOriginalAmount)) + geom_jitter(aes(color = IncomeRange),alpha = 0.88) + scale_color_brewer(type = "qual") + facet_wrap(~IsBorrowerHomeowner)

It can also be seen that most income earners of 100,000 or more are house owners as well. They have also received larger amounts of loan. Could it be argued that you stand a chance of receiving higher loans if you have more collateral(a question asked in general not based on OrangeJuice data)?

ggplot(data = loan_data, aes(y = BorrowerAPR, x = EstimatedEffectiveYield)) +  geom_jitter(aes(color = EstimatedLoss, size = LoanOriginalAmount),alpha = 0.5 ) 

When looking at estimated effective yield and the APR, you see a positive relation at most parts. The yield extends to a negative part as well. When yield is coming from a negative to zero, apr is decreasing. When yield is positive and increasing so is APR. They both have a direct relationship with each other. The loss seems to be higher with a negative yield. The loss seems to be lowest where yield and apr are lowest. There are a few big loan amounts where yield is negative and loss is high.

est <- subset(loan_data,IncomeRange != "Not displayed" & IncomeRange != "Not employed" & IncomeRange != "$0")
est$IncomeRange <- factor(est$IncomeRange, levels = c("$1-24,999","$25,000-49,999","$50,000-74,999","75,000-99,999","$100,000+"))
ggplot(data = subset(est,!is.na(IncomeRange)), aes(x = EstimatedEffectiveYield, y = EstimatedLoss)) + geom_point(aes(color = factor(OrangeJuiceScore)),alpha = 0.88) + scale_color_brewer(type = "div") + facet_wrap(~IncomeRange) 

It can be seen that low OrangeJuice scores are predicted to bring losses, so high interest rates are given to those receivers. Good scores get good rates but do not necessarily get high yield amounts.

est <- subset(loan_data,IncomeRange != "Not displayed" & IncomeRange != "Not employed" & IncomeRange != "$0")
est$IncomeRange <- factor(est$IncomeRange, levels = c("$1-24,999","$25,000-49,999","$50,000-74,999","75,000-99,999","$100,000+"))
ggplot(data = subset(est,!is.na(IncomeRange)), aes(y = BorrowerAPR, x = LoanOriginalAmount)) + geom_point(aes(color = factor(OrangeJuiceScore)),alpha = 0.88) + scale_color_brewer(type = "div") + facet_wrap(~IncomeRange) 

What can be seen is that there are a lot of users who have an income of 100,000 or more. In general poor OrangeJuice scores result in higher interest rates and good OrangeJuice scores give you low rates. People who have received large loan amounts have a high income, have great OrangeJuice scores and have received low-interest rates.

ggplot(data = loan_data, aes(y = BorrowerAPR, x = Investors)) +  geom_point(aes(color = LoanOriginalAmount)) + facet_wrap(~OrangeJuiceScore) + scale_color_gradient(low = "black", high = "purple",space = "Lab", guide = "colourbar")

We look at Borrower APR compared to the number of investors for each OrangeJuice score. The higher OrangeJuice scores have received larger loan amounts. Interest rates decrease as the number of investors increase. There is also a chunk of data which do not have OrangeJuice scores. They follow the same principle, more investors lead to a better interest rate and larger loan amount.

ggplot(data = loan_data, aes(x = Investors, y = EstimatedLoss)) + geom_point(aes(color = LenderYield)) + scale_color_gradient(low = "darkmagenta", high = "seagreen1",space = "Lab", guide = "colourbar")

Estimated loss decreases as the number of investors increase. Yield seems to be higher when there are fewer investors. Higher yields and few investors are estimated to have higher loss.

Conclusion

After the long analysis and taking a look at all the different variables in different forms, it does make a story. OrangeJuice lends out money mainly of smaller amounts to prospective borrowers. OrangeJuice has its client base mainly around California but it caters to people all across the United States. OrangeJuice has a two-way system of borrowing and lending which could be interpreted into completely unique ways. OrangeJuice has clients with different types of occupation but most of them have remained confidential and stated themselves as others. When considering the mechanism behind their lending services it seems that a lot of it depends on the user’s history. I was unable to do a dynamic analysis with time into factor, but as you use the service and work on your OrangeJuiceScore you get better deals. A good way to get a step a head in the game would be to have a good credit score, it influences the rates and loan received. The OrangeJuice score seems to become one of the most important factors for a loan. It helps to have a high annual income as well. The number of investors also affects the rate received by the borrower. Geographical influence on loans is present, but cannot be made with the amount of data in the dataset. Lastly, I would like to note that a lot of the variables are highly correlated to one another. They could be a part of a larger financial function. There can be many different types of analysis made about the population dataset. Overall it was a good learning experience.

Further Investigations

My hope is that the findings from this analyses can be transfered into a linear model. With the information here and new information from the population data set, we can create predictive models. There are too many variables, the best way would be to split the population data set into smaller segments which suit your research topic.

Home